Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

160

Useful Algorithms

13.2

Botryology

The term “botryology”, apparently coined by I. J. Good (1962), was introduced at

a time when the task of ﬁnding clusters was generally focused on objects arranged

in ordinary (Euclidean) space (e.g., stars clustered into galaxies). It signiﬁes a more

general approach to ﬁnding clusters or clumps, concerned with logical and qualita-

tive relationships, chosen for their relevance to the matter in hand, rather than with

ordinary distance (or a metric satisfying a triangle inequality; see below). Possibly

relevance could be deﬁned according to success in ﬁnding clusters or clumps, hence

permitting iterative reﬁnement of the deﬁnition.

Since then the notion of clustering has anyway been somewhat generalized,

and typically now includes any process whereby relevance can lead to a numeri-

cal attribute (e.g., the conditional probability of use of an object). The objects are

nodes on a graph (Sect. 12.2), and the links between them (edges) give the relevance.

Thus, an elementa Subscript i jai j of the adjacency matrixupper AA gives the relevance ofii toj j. This may

not be the same as a Subscript j ia ji, giving the relevance of j j to ii; hence, the graph is a directed

one. On the other hand, association factors such asupper P left parenthesis i j right parenthesis divided by upper P left parenthesis i right parenthesis upper P left parenthesis j right parenthesisP(i j)/P(i)P( j) (the probability

of the joint occurrences divided by the product of the probabilities of the separate

occurrences) are symmetrical. The degree of clumpiness of a group of nodes could

then be given by summing the elements of the adjacency matrix of the group and

dividing by the number of elements in the group; a clump could be considered as

complete if the addition of an extra node would bring the clumpiness below some

threshold.

Possibly, it is useful to use the term “clustering” for the formal process (which

can be carried out on a computer) described in Sect. 13.2.1 and the term “clumping”

for a more general process (of which clustering would be a subset), for which formal

deﬁnitions might not always be available.

It is possible to conceive a highly automated mode of scientiﬁc investigation, in

which every object in the universe would be parametrized (by which it is meant that

a numerical value is assigned to every attribute). In order to investigate something

more speciﬁcally, the researcher would select the relevant collection of objects (e.g.,

“furry mammals”) and apply some kind of dimensional reduction to the dataset (if

the attributes were chosen from some vast standardized set, many would, of course,

have values of zero for a particular collection), preferably down to two or three, ³

after which a clustering algorithm would be applied. ⁴

13.2.1

Clustering

While supervised pattern recognition (i.e., with a teacher) corresponds to the most

familiar kind of pattern recognition carried out by human beings throughout their

3 For example, using principal component analysis (PCA) (q.v.).

4 See Gorban et al. (2005) for an example.